Tutorial for gwaslab¶
- In this tutorial, we will perform a common pipeline for sumstats QC, standardization and harmonization.
- We will also show examples of visualization.
- Using a jupyter notebook, we first download sample data.
Download sample data¶
The data we will use as an example: sumstats of type 2 diabetes from BBJ
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
--2022-10-21 15:41:14-- http://jenger.riken.jp/14/ Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25 Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 274187574 (261M) [text/plain] Saving to: ‘t2d_bbj.txt.gz’ t2d_bbj.txt.gz 100%[===================>] 261.49M 38.0MB/s in 7.3s 2022-10-21 15:41:22 (36.0 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]
Import gwaslab package¶
If you download it from pip, simply run:
import gwaslab as gl
Or if you want to use the latest version from github (beta version) you can clone the repo and import like:
import sys
sys.path.insert(0,"/home/yunye/gwaslab/gwaslab/src")
import gwaslab as gl
Loading data into gwaslab Sumstats¶
Let's import this raw sumstats into the gwaslab Sumstats Object by specifying the necessary columns.
Note: you can either specify eaf (effect allele frequency) or neaf (non-effect allele frequency), if neaf is specified, it will be converted to eaf when loading sumstats.
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
neaf="Frq",
beta="BETA",
se="SE",
p="P",
direction="Dir",
n="N")
Sat Nov 5 20:39:31 2022 Start to initiate from file :t2d_bbj.txt.gz Sat Nov 5 20:39:49 2022 -Reading columns : REF,SE,Dir,SNP,POS,P,BETA,CHR,ALT,N,Frq Sat Nov 5 20:39:49 2022 -Renaming columns to : NEA,SE,DIRECTION,SNPID,POS,P,BETA,CHR,EA,N,EAF Sat Nov 5 20:39:49 2022 -Current Dataframe shape : 12557761 x 11 Sat Nov 5 20:39:51 2022 -Initiating a status column: STATUS ... Sat Nov 5 20:39:56 2022 -NEAF is specified... Sat Nov 5 20:39:56 2022 -Checking if 0<= NEAF <=1 ... Sat Nov 5 20:39:58 2022 -Converted NEAF to EAF. Sat Nov 5 20:39:58 2022 -Removed 0 variants with bad NEAF. Sat Nov 5 20:39:58 2022 Start to reorder the columns... Sat Nov 5 20:39:58 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:39:58 2022 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Sat Nov 5 20:39:59 2022 Finished sorting columns successfully! Sat Nov 5 20:39:59 2022 Finished loading data successfully!
Sumstats are stored in Sumstats.data as apandas.DataFrame, you can check the data by:
mysumstats.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9999999 |
| 1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9999999 |
| 2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9999999 |
| 3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9999999 |
| 4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9999999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12557756 | X:154874837_A_G | X | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9999999 |
| 12557757 | X:154875192_GTACTC_G | X | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9999999 |
| 12557758 | X:154879115_A_G | X | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9999999 |
| 12557759 | X:154880669_T_A | X | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9999999 |
| 12557760 | X:154880917_C_T | X | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9999999 |
12557761 rows × 12 columns
For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/
Create Manhattan and Q-Q plot¶
Maybe the first thing you want to check is the manhattan and qq plots. gwaslab will do a minimum QC for sumstats when plotting.
For details on Manhattan and Q-Q plot, see: https://cloufield.github.io/gwaslab/Visualization/
# skip : skip variants with MLOG10P<2 for faster plotting speed
mysumstats.plot_mqq(skip=2)
Sat Nov 5 20:29:59 2022 Start to plot manhattan/qq plot with the following basic settings: Sat Nov 5 20:29:59 2022 -Genome-wide significance level is set to 5e-08 ... Sat Nov 5 20:29:59 2022 -Raw input contains 12557761 variants... Sat Nov 5 20:29:59 2022 -Plot layout mode is : mqq Sat Nov 5 20:30:07 2022 Finished loading specified columns from the sumstats. Sat Nov 5 20:30:07 2022 Start conversion and sanity check: Sat Nov 5 20:30:08 2022 -Removed 328791 variants with nan in CHR or POS column ... Sat Nov 5 20:30:08 2022 -Removed 0 variants with nan in P column ... Sat Nov 5 20:30:08 2022 -P values are being converted to -log10(P)... Sat Nov 5 20:30:08 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sat Nov 5 20:30:10 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Sat Nov 5 20:30:12 2022 -Maximum -log10(P) values is 167.58838029403677 . Sat Nov 5 20:30:12 2022 Finished data conversion and sanity check. Sat Nov 5 20:30:12 2022 Start to create manhattan plot with 321353 variants: Sat Nov 5 20:30:13 2022 -Found 84 significant variants with a sliding window size of 500 kb... Sat Nov 5 20:30:13 2022 Finished creating Manhattan plot successfully! Sat Nov 5 20:30:13 2022 -Skip annotating Sat Nov 5 20:30:13 2022 Start to create QQ plot with 321353 variants: Sat Nov 5 20:30:14 2022 -Calculating GC using P : 1.2139048028292598 Sat Nov 5 20:30:14 2022 Finished creating QQ plot successfully!
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7f6a206fcbe0>)
Standardization & QC : basic_check()¶
It is needed to check SNP ID, rsID, chromosome, basepair position, alleles and statistics first before any manipulations (sometimes you need do this before plotting if the sumstats is in a really messy format):
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.filter_value('CHR=="7"')
mysumstats.basic_check()
Sat Nov 5 20:40:24 2022 Start filtering values by condition: CHR=="7" Sat Nov 5 20:40:25 2022 -Removing 11849981 variants not meeting the conditions: CHR=="7" Sat Nov 5 20:40:25 2022 Finished filtering values. Sat Nov 5 20:40:25 2022 Start to check IDs... Sat Nov 5 20:40:25 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:40:25 2022 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _) Sat Nov 5 20:40:38 2022 Finished checking IDs successfully! Sat Nov 5 20:40:38 2022 Start to fix chromosome notation... Sat Nov 5 20:40:38 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:41:12 2022 -Vairants with fixable chromosome notations: 328791 Sat Nov 5 20:41:14 2022 -Converting to string datatype and UPPERCASE... Sat Nov 5 20:41:15 2022 -Stripping chr prefix if exists : CHR_-.0... Sat Nov 5 20:41:17 2022 -Identified 328791 variants on sex chromosomes... Sat Nov 5 20:41:17 2022 -Standardizing sex chromosome notations: X Y MT to 23,24,25... Sat Nov 5 20:41:32 2022 -No unrecognized chromosome notations... Sat Nov 5 20:41:39 2022 Finished fixing chromosome notation successfully! Sat Nov 5 20:41:39 2022 Start to fix basepair positions... Sat Nov 5 20:41:39 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:41:55 2022 -Position upper_bound is: 250,000,000 Sat Nov 5 20:42:24 2022 -Remove outliers: 0 Sat Nov 5 20:42:27 2022 -Converted all position to datatype Int64. Sat Nov 5 20:42:27 2022 Finished fixing basepair position successfully! Sat Nov 5 20:42:27 2022 Start to fix alleles... Sat Nov 5 20:42:27 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:42:31 2022 -Detected 0 variants with alleles that contain bases other than A/C/T/G . Sat Nov 5 20:42:34 2022 -Converted all bases to string datatype and UPPERCASE. Sat Nov 5 20:42:52 2022 Finished fixing allele successfully! Sat Nov 5 20:42:52 2022 Start sanity check for statistics ... Sat Nov 5 20:42:52 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:42:53 2022 -Checking if 0 <=N<= inf ... Sat Nov 5 20:42:56 2022 -Removed 0 variants with bad N. Sat Nov 5 20:42:56 2022 -Checking if 0 <=EAF<= 1 ... Sat Nov 5 20:42:58 2022 -Removed 0 variants with bad EAF. Sat Nov 5 20:42:58 2022 -Checking if 5 <=MAC<= inf ... Sat Nov 5 20:43:02 2022 -Removed 0 variants with bad MAC. Sat Nov 5 20:43:02 2022 -Checking if 5e-300 <= P <= 1 ... Sat Nov 5 20:43:04 2022 -Removed 0 variants with bad P. Sat Nov 5 20:43:04 2022 -Checking if -10 <BETA)< 10 ... Sat Nov 5 20:43:06 2022 -Removed 0 variants with bad BETA. Sat Nov 5 20:43:06 2022 -Checking if 0 <SE< inf ... Sat Nov 5 20:43:08 2022 -Removed 0 variants with bad SE. Sat Nov 5 20:43:08 2022 -Checking STATUS... Sat Nov 5 20:43:09 2022 -Coverting STAUTUS to category. Sat Nov 5 20:43:11 2022 -Removed 0 variants with bad statistics in total. Sat Nov 5 20:43:11 2022 Finished sanity check successfully! Sat Nov 5 20:43:11 2022 Start to normalize variants... Sat Nov 5 20:43:11 2022 -Current Dataframe shape : 12557761 x 12 Sat Nov 5 20:43:14 2022 -Not normalized allele:['TT' 'TC']['CTTTT' 'CTTT']['ATTT' 'ATT']['ATTT' 'ATT']['GTT' 'GT']... Sat Nov 5 20:43:14 2022 -Modified 13 variants according to parsimony and left alignment principal. Sat Nov 5 20:43:16 2022 Finished normalizing variants successfully!
mysumstats.data
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:725932_G_A | 1 | 725932 | G | A | 0.9960 | -0.0737 | 0.1394 | 0.5970 | 166718 | -?+- | 9960099 |
| 1 | 1:725933_A_G | 1 | 725933 | G | A | 0.0040 | 0.0737 | 0.1394 | 0.5973 | 166718 | +?-+ | 9960099 |
| 2 | 1:737801_T_C | 1 | 737801 | C | T | 0.0051 | 0.0490 | 0.1231 | 0.6908 | 166718 | +?-+ | 9960099 |
| 3 | 1:749963_T_TAA | 1 | 749963 | TAA | T | 0.8374 | 0.0213 | 0.0199 | 0.2846 | 166718 | -?++ | 9960399 |
| 4 | 1:751343_T_A | 1 | 751343 | T | A | 0.8593 | 0.0172 | 0.0156 | 0.2705 | 166718 | -?++ | 9960099 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12557756 | X:154874837_A_G | 23 | 154874837 | G | A | 0.7478 | -0.0064 | 0.0117 | 0.5840 | 191764 | -+-+ | 9960099 |
| 12557757 | X:154875192_GTACTC_G | 23 | 154875192 | GTACTC | G | 0.2525 | 0.0071 | 0.0122 | 0.5612 | 191764 | +-+- | 9960399 |
| 12557758 | X:154879115_A_G | 23 | 154879115 | G | A | 0.7463 | -0.0070 | 0.0122 | 0.5646 | 191764 | -+-+ | 9960099 |
| 12557759 | X:154880669_T_A | 23 | 154880669 | T | A | 0.2558 | 0.0071 | 0.0122 | 0.5618 | 191764 | +-+- | 9960099 |
| 12557760 | X:154880917_C_T | 23 | 154880917 | C | T | 0.2558 | 0.0072 | 0.0122 | 0.5570 | 191764 | +-+- | 9960099 |
12557761 rows × 12 columns
By checking the log, the sumstats look good. But we found several variants that were not normalizaed and gwaslab fixed this.
.basic_check() is a wrapper of all the following basic functions, you can use these separately.
- mysumstats.fix_ID()
- mysumstats.fix_chr()
- mysumstats.fix_pos()
- mysumstats.fix_allele()
- mysumstats.check_sanity()
- mysumstats.normalize_allele()
For other options, see: https://cloufield.github.io/gwaslab/Standardization/
Extract lead variants : get_lead()¶
Let's extract the lead variants in each significant loci to check our data.
The significant loci are detected based on a sliding window (default window size: windowsizekb=500 kb)
By specifying anno=True , gwaslab will also annotate the lead variant with its nearest gene names and distance.
Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed.
mysumstats.get_lead(anno=True)
Sat Nov 5 20:43:16 2022 Start to extract lead variants... Sat Nov 5 20:43:16 2022 -Processing 12557761 variants... Sat Nov 5 20:43:16 2022 -Significance threshold : 5e-08 Sat Nov 5 20:43:16 2022 -Sliding window size: 500 kb Sat Nov 5 20:43:20 2022 -Found 9461 significant variants in total... Sat Nov 5 20:43:21 2022 -Identified 89 lead variants! Sat Nov 5 20:43:21 2022 Start to annotate variants with nearest gene name(s)... Sat Nov 5 20:43:21 2022 -Assigning Gene name using Ensembl Release hg19 Sat Nov 5 20:43:21 2022 Config file not exists... Sat Nov 5 20:43:21 2022 Created new config file... Sat Nov 5 20:43:21 2022 -Config file does not exist. Sat Nov 5 20:43:21 2022 -Recreating configuration file. Sat Nov 5 20:43:21 2022 Start to download ensembl_hg19_gtf_protein_coding ... Sat Nov 5 20:43:21 2022 -Downloading to: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz
INFO:pyensembl.database:Creating database: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Homo_sapiens.GRCh37.75.protein_coding.gtf.db INFO:pyensembl.database:Reading GTF from /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz
Sat Nov 5 20:43:31 2022 -Updating record in config file... Sat Nov 5 20:43:31 2022 Downloaded ensembl_hg19_gtf_protein_coding successfully!
/home/yunye/anaconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future. chunk_iterator = pd.read_csv( /home/yunye/anaconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future. chunk_iterator = pd.read_csv( INFO:root:Extracted GTF attributes: ['gene_id', 'transcript_id', 'gene_name', 'gene_biotype', 'transcript_name', 'exon_number', 'exon_id', 'ccds_id', 'protein_id'] INFO:root:Using column 'source' to replace missing 'transcript_biotype' INFO:datacache.database_helpers:Creating database /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Homo_sapiens.GRCh37.75.protein_coding.gtf.db containing: exon, transcript, start_codon, CDS, gene, stop_codon INFO:datacache.database:Running sqlite query: "CREATE TABLE exon (seqname TEXT NOT NULL, transcript_id TEXT NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 1195297 rows into table exon INFO:datacache.database:Creating index on exon (seqname, start, end) INFO:datacache.database:Creating index on exon (gene_name) INFO:datacache.database:Creating index on exon (gene_id) INFO:datacache.database:Creating index on exon (transcript_id) INFO:datacache.database:Creating index on exon (transcript_name) INFO:datacache.database:Creating index on exon (exon_id) INFO:datacache.database:Creating index on exon (protein_id) INFO:datacache.database:Creating index on exon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE transcript (seqname TEXT NOT NULL, transcript_id TEXT UNIQUE PRIMARY KEY NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 165581 rows into table transcript INFO:datacache.database:Creating index on transcript (seqname, start, end) INFO:datacache.database:Creating index on transcript (gene_name) INFO:datacache.database:Creating index on transcript (gene_id) INFO:datacache.database:Creating index on transcript (transcript_name) INFO:datacache.database:Creating index on transcript (exon_id) INFO:datacache.database:Creating index on transcript (protein_id) INFO:datacache.database:Creating index on transcript (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE start_codon (seqname TEXT NOT NULL, transcript_id TEXT NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 73293 rows into table start_codon INFO:datacache.database:Creating index on start_codon (seqname, start, end) INFO:datacache.database:Creating index on start_codon (gene_name) INFO:datacache.database:Creating index on start_codon (gene_id) INFO:datacache.database:Creating index on start_codon (transcript_id) INFO:datacache.database:Creating index on start_codon (transcript_name) INFO:datacache.database:Creating index on start_codon (exon_id) INFO:datacache.database:Creating index on start_codon (protein_id) INFO:datacache.database:Creating index on start_codon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE CDS (seqname TEXT NOT NULL, transcript_id TEXT NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 791057 rows into table CDS INFO:datacache.database:Creating index on CDS (seqname, start, end) INFO:datacache.database:Creating index on CDS (gene_name) INFO:datacache.database:Creating index on CDS (gene_id) INFO:datacache.database:Creating index on CDS (transcript_id) INFO:datacache.database:Creating index on CDS (transcript_name) INFO:datacache.database:Creating index on CDS (exon_id) INFO:datacache.database:Creating index on CDS (protein_id) INFO:datacache.database:Creating index on CDS (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE gene (seqname TEXT NOT NULL, transcript_id TEXT NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT UNIQUE PRIMARY KEY NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 24159 rows into table gene INFO:datacache.database:Creating index on gene (seqname, start, end) INFO:datacache.database:Creating index on gene (gene_name) INFO:datacache.database:Creating index on gene (transcript_id) INFO:datacache.database:Creating index on gene (transcript_name) INFO:datacache.database:Creating index on gene (exon_id) INFO:datacache.database:Creating index on gene (protein_id) INFO:datacache.database:Creating index on gene (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE stop_codon (seqname TEXT NOT NULL, transcript_id TEXT NOT NULL, feature TEXT NOT NULL, gene_name TEXT NOT NULL, source TEXT NOT NULL, gene_biotype TEXT NOT NULL, transcript_name TEXT NOT NULL, strand TEXT NOT NULL, start INT NOT NULL, transcript_biotype TEXT NOT NULL, end INT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, exon_id TEXT NOT NULL, ccds_id TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 73352 rows into table stop_codon INFO:datacache.database:Creating index on stop_codon (seqname, start, end) INFO:datacache.database:Creating index on stop_codon (gene_name) INFO:datacache.database:Creating index on stop_codon (gene_id) INFO:datacache.database:Creating index on stop_codon (transcript_id) INFO:datacache.database:Creating index on stop_codon (transcript_name) INFO:datacache.database:Creating index on stop_codon (exon_id) INFO:datacache.database:Creating index on stop_codon (protein_id) INFO:datacache.database:Creating index on stop_codon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE _datacache_metadata (version INT)" INFO:datacache.database:Running sqlite query: "INSERT INTO _datacache_metadata VALUES (3)"
Sat Nov 5 20:44:38 2022 Finished annotating variants with nearest gene name(s) successfully! Sat Nov 5 20:44:38 2022 Finished extracting lead variants successfully!
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | LOCATION | GENE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 96739 | 1:22068326_A_G | 1 | 22068326 | G | A | 0.7550 | 0.0621 | 0.0103 | 1.629000e-09 | 191764 | ++++ | 9960099 | 0 | USP48 |
| 213860 | 1:51103268_T_C | 1 | 51103268 | C | T | 0.7953 | -0.0802 | 0.0120 | 2.519000e-11 | 191764 | ---- | 9960099 | 0 | FAF1 |
| 534095 | 1:154309595_TA_T | 1 | 154309595 | TA | T | 0.0947 | -0.0915 | 0.0166 | 3.289000e-08 | 191764 | ---- | 9960399 | 0 | ATP8B2 |
| 969974 | 2:640986_CACAT_C | 2 | 640986 | C | CACAT | 0.9006 | -0.0946 | 0.0150 | 2.665000e-10 | 191764 | ---- | 9960399 | 26349 | TMEM18 |
| 1091807 | 2:27734972_G_A | 2 | 27734972 | G | A | 0.5605 | 0.0691 | 0.0088 | 3.897000e-15 | 191764 | ++++ | 9960099 | 0 | GCKR |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12272930 | X:21569920_A_G | 23 | 21569920 | G | A | 0.3190 | 0.0423 | 0.0076 | 2.616000e-08 | 191764 | ++++ | 9960099 | 0 | CNKSR2 |
| 12341406 | X:48724648_CAA_C | 23 | 48724648 | C | CAA | 0.6260 | -0.0602 | 0.0103 | 4.576000e-09 | 191764 | ---- | 9960399 | 26082 | TIMM17B |
| 12350767 | X:57170781_A_AT | 23 | 57170781 | AT | A | 0.3003 | -0.0447 | 0.0076 | 4.583000e-09 | 191764 | ---- | 9960399 | -6723 | SPIN2A |
| 12469290 | X:117915163_T_TA | 23 | 117915163 | TA | T | 0.5560 | 0.0548 | 0.0071 | 9.818000e-15 | 191764 | ++++ | 9960399 | 0 | IL13RA1 |
| 12554976 | X:152908887_G_A | 23 | 152908887 | G | A | 0.6792 | -0.1235 | 0.0077 | 9.197000e-58 | 191764 | ---- | 9960099 | 0 | DUSP9 |
89 rows × 14 columns
For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/
Use the SNPID to create some highly customized mqq plot¶
Gwaslab can create more complicated manhattan plot.
plot and save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}
mysumstats.plot_mqq(mode="mqq",
cut=20,skip=3,
anno="GENENAME",
anno_set=["9:22132729_A_G","6:20688121_T_A","9:22132729_A_G","15:62394264_G_C"] ,
pinpoint=["9:22132729_A_G","5:176513896_C_A"],
highlight=["7:127253550_C_T","19:46166604_C_T"],
highlight_windowkb =1000,
stratified=True,
marker_size=(5,10),
figargs={"figsize":(15,5),"dpi":300},
save="my_first_mqq_plot.png",
saveargs={"dpi":400,"facecolor":"white"})
Sat Nov 5 20:44:38 2022 Start to plot manhattan/qq plot with the following basic settings: Sat Nov 5 20:44:38 2022 -Genome-wide significance level is set to 5e-08 ... Sat Nov 5 20:44:38 2022 -Raw input contains 12557761 variants... Sat Nov 5 20:44:38 2022 -Plot layout mode is : mqq Sat Nov 5 20:44:38 2022 -Variants to annotate : 9:22132729_A_G,6:20688121_T_A,9:22132729_A_G,15:62394264_G_C Sat Nov 5 20:44:38 2022 -Loci to highlight : 7:127253550_C_T,19:46166604_C_T Sat Nov 5 20:44:38 2022 -Highlight_window is set to: 1000 kb Sat Nov 5 20:44:38 2022 -Variants to pinpoint : 9:22132729_A_G,5:176513896_C_A Sat Nov 5 20:44:45 2022 Finished loading specified columns from the sumstats. Sat Nov 5 20:44:45 2022 Start conversion and sanity check: Sat Nov 5 20:44:47 2022 -Removed 0 variants with nan in CHR or POS column ... Sat Nov 5 20:44:48 2022 -Removed 0 variants with nan in EAF column ... Sat Nov 5 20:44:50 2022 -Removed 0 variants with nan in P column ... Sat Nov 5 20:44:50 2022 -P values are being converted to -log10(P)... Sat Nov 5 20:44:50 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sat Nov 5 20:44:52 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Sat Nov 5 20:44:55 2022 -Maximum -log10(P) values is 167.58838029403677 . Sat Nov 5 20:44:55 2022 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Sat Nov 5 20:44:55 2022 Finished data conversion and sanity check. Sat Nov 5 20:44:55 2022 Start to create manhattan plot with 91234 variants: Sat Nov 5 20:44:55 2022 -Highlighting target loci... Sat Nov 5 20:44:55 2022 -Pinpointing target vairants... Sat Nov 5 20:44:55 2022 -Found 3 specified variants to annotate... Sat Nov 5 20:44:55 2022 Start to annotate variants with nearest gene name(s)... Sat Nov 5 20:44:55 2022 -Assigning Gene name using Ensembl Release hg19 Sat Nov 5 20:44:55 2022 Finished annotating variants with nearest gene name(s) successfully! Sat Nov 5 20:44:55 2022 Finished creating Manhattan plot successfully! Sat Nov 5 20:44:55 2022 -Annotating using column GENENAME... Sat Nov 5 20:44:55 2022 Start to create QQ plot with 91234 variants: Sat Nov 5 20:44:57 2022 -Calculating GC using P : 1.2128258399293808 Sat Nov 5 20:44:57 2022 Finished creating QQ plot successfully! Sat Nov 5 20:44:57 2022 Saving plot: Sat Nov 5 20:44:59 2022 -Saved to my_first_mqq_plot.png successfully!
(<Figure size 4500x1500 with 2 Axes>, <gwaslab.Log.Log at 0x7f985e37bd60>)
For details, see: https://cloufield.github.io/gwaslab/Visualization/
Quick regional plot without LD-information¶
Gwaslab can plot regional plot with or with out LD reference files.
For details see: https://cloufield.github.io/gwaslab/RegionalPlot/
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True)
Sat Nov 5 20:45:01 2022 Start to plot manhattan/qq plot with the following basic settings: Sat Nov 5 20:45:01 2022 -Genome-wide significance level is set to 5e-08 ... Sat Nov 5 20:45:01 2022 -Raw input contains 12557761 variants... Sat Nov 5 20:45:01 2022 -Plot layout mode is : r Sat Nov 5 20:45:01 2022 -Region to plot : chr7:156538803-157538803. Sat Nov 5 20:45:03 2022 -Extract SNPs in region : chr7:156538803-157538803... Sat Nov 5 20:45:33 2022 -Extract SNPs in specified regions: 5831 Sat Nov 5 20:45:33 2022 Finished loading specified columns from the sumstats. Sat Nov 5 20:45:33 2022 Start conversion and sanity check: Sat Nov 5 20:45:33 2022 -Removed 0 variants with nan in CHR or POS column ... Sat Nov 5 20:45:33 2022 -Removed 0 variants with nan in P column ... Sat Nov 5 20:45:33 2022 -P values are being converted to -log10(P)... Sat Nov 5 20:45:33 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sat Nov 5 20:45:33 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Sat Nov 5 20:45:33 2022 -Maximum -log10(P) values is 7.948076083953893 . Sat Nov 5 20:45:33 2022 Finished data conversion and sanity check. Sat Nov 5 20:45:33 2022 Start to create manhattan plot with 5831 variants: Sat Nov 5 20:45:33 2022 Start to download recombination_hg19 ... Sat Nov 5 20:45:33 2022 -Downloading to: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/recombination/hg19/recombination_hg19.tar.gz Sat Nov 5 20:45:38 2022 -Updating record in config file... Sat Nov 5 20:45:38 2022 Downloaded recombination_hg19 successfully! Sat Nov 5 20:45:38 2022 -Loading gtf files from:defualt Sat Nov 5 20:45:38 2022 No records in config file. Please download first. Sat Nov 5 20:45:38 2022 Start to download ensembl_hg19_gtf ... Sat Nov 5 20:45:38 2022 -Downloading to: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Homo_sapiens.GRCh37.87.chr.gtf.gz Sat Nov 5 20:48:18 2022 -Updating record in config file... Sat Nov 5 20:48:18 2022 Downloaded ensembl_hg19_gtf successfully! Sat Nov 5 20:48:31 2022 -plotting gene track.. Sat Nov 5 20:48:31 2022 -Finished plotting gene track.. Sat Nov 5 20:48:31 2022 -Found 1 significant variants with a sliding window size of 500 kb... Sat Nov 5 20:48:31 2022 Finished creating Manhattan plot successfully! Sat Nov 5 20:48:31 2022 -Skip annotating
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7f985e37bd60>)
reference download¶
Full regional plot using a user-provided vcf or preprocessed vcf: (e.g 1000 genome, see Reference: https://cloufield.github.io/gwaslab/Reference/)
check available reference¶
update the available reference list first
gl.update_available_ref()
Sat Nov 5 20:48:32 2022 Updating available_ref list from: https://raw.github.com/Cloufield/gwaslab/main/src/gwaslab/data/reference.json Sat Nov 5 20:48:33 2022 Available_ref list has been updated!
gl.check_available_ref()
Sat Nov 5 20:48:33 2022 Start to check available reference files... Sat Nov 5 20:48:33 2022 - 1kg_eas_hg19 : https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eas_hg19_tbi : https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eur_hg19 : https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eur_hg19_tbi : https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eas_hg38 : https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eas_hg38_tbi : https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eur_hg38 : https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_eur_hg38_tbi : https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1 Sat Nov 5 20:48:33 2022 - dbsnp_v151_hg19 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz Sat Nov 5 20:48:33 2022 - dbsnp_v151_hg38 : https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz Sat Nov 5 20:48:33 2022 - ucsc_genome_hg19 : http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz Sat Nov 5 20:48:33 2022 - ucsc_genome_hg38 : https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz Sat Nov 5 20:48:33 2022 - 1kg_dbsnp151_hg19_auto : https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1 Sat Nov 5 20:48:33 2022 - 1kg_dbsnp151_hg38_auto : https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1 Sat Nov 5 20:48:33 2022 - recombination_hg19 : https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1 Sat Nov 5 20:48:33 2022 - ensembl_hg19_gtf : https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz Sat Nov 5 20:48:33 2022 - ensembl_hg19_gtf_protein_coding : https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1 Sat Nov 5 20:48:33 2022 - ensembl_hg38_gtf : https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz Sat Nov 5 20:48:33 2022 - ensembl_hg38_gtf_protein_coding : https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1 Sat Nov 5 20:48:33 2022 - refseq_hg19_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz Sat Nov 5 20:48:33 2022 - refseq_hg19_gtf_protein_coding : https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1 Sat Nov 5 20:48:33 2022 - refseq_hg38_gtf : https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz Sat Nov 5 20:48:33 2022 - refseq_hg38_gtf_protein_coding : https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1 Sat Nov 5 20:48:33 2022 - testlink : https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1 Sat Nov 5 20:48:33 2022 - testlink_tbi : https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1
{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
'1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
'1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
'1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
'1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
'1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
'1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
'1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz',
'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz',
'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz',
'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz',
'1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1',
'1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1',
'recombination_hg19': 'https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1',
'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz',
'ensembl_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1',
'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz',
'ensembl_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1',
'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz',
'refseq_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1',
'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz',
'refseq_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1',
'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1',
'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1'}
download reference using gwaslab¶
gl.download_ref("1kg_eas_hg19")
Sat Nov 5 20:48:33 2022 Start to download 1kg_eas_hg19 ... Sat Nov 5 20:48:33 2022 -Downloading to: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Sat Nov 5 20:53:19 2022 -Updating record in config file... Sat Nov 5 20:53:20 2022 -Updating record in config file... Sat Nov 5 20:53:20 2022 -Downloading to: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi Sat Nov 5 20:53:20 2022 Downloaded 1kg_eas_hg19 successfully!
after downloading, use get_path to select the file path
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,vcf_path=gl.get_path("1kg_eas_hg19"))
Sat Nov 5 20:53:20 2022 Start to plot manhattan/qq plot with the following basic settings: Sat Nov 5 20:53:20 2022 -Genome-wide significance level is set to 5e-08 ... Sat Nov 5 20:53:20 2022 -Raw input contains 12557761 variants... Sat Nov 5 20:53:20 2022 -Plot layout mode is : r Sat Nov 5 20:53:20 2022 -Region to plot : chr7:156538803-157538803. Sat Nov 5 20:53:22 2022 -Extract SNPs in region : chr7:156538803-157538803... Sat Nov 5 20:53:50 2022 -Extract SNPs in specified regions: 5831 Sat Nov 5 20:53:50 2022 Finished loading specified columns from the sumstats. Sat Nov 5 20:53:50 2022 Start conversion and sanity check: Sat Nov 5 20:53:50 2022 -Removed 0 variants with nan in CHR or POS column ... Sat Nov 5 20:53:50 2022 -Removed 0 variants with nan in P column ... Sat Nov 5 20:53:50 2022 -P values are being converted to -log10(P)... Sat Nov 5 20:53:50 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sat Nov 5 20:53:50 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Sat Nov 5 20:53:51 2022 -Maximum -log10(P) values is 7.948076083953893 . Sat Nov 5 20:53:51 2022 Finished data conversion and sanity check. Sat Nov 5 20:53:51 2022 Start to load reference genotype... Sat Nov 5 20:53:51 2022 -reference vcf path : /home/yunye/gwaslab/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Sat Nov 5 20:53:53 2022 -Retrieving index... 16384 Sat Nov 5 20:53:53 2022 -Calculating Rsq... Sat Nov 5 20:53:53 2022 Finished loading reference genotype successfully! Sat Nov 5 20:53:53 2022 Start to create manhattan plot with 5831 variants: Sat Nov 5 20:53:53 2022 -Loading gtf files from:defualt Sat Nov 5 20:54:06 2022 -plotting gene track.. Sat Nov 5 20:54:06 2022 -Finished plotting gene track.. Sat Nov 5 20:54:06 2022 -Found 1 significant variants with a sliding window size of 500 kb... Sat Nov 5 20:54:06 2022 Finished creating Manhattan plot successfully! Sat Nov 5 20:54:06 2022 -Skip annotating
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f985e37bd60>)
or you can provide your own vcf file. Let's annotate the lead variant this time.
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,anno=True,
vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
Sat Nov 5 12:53:17 2022 Start to plot manhattan/qq plot with the following basic settings: Sat Nov 5 12:53:17 2022 -Genome-wide significance level is set to 5e-08 ... Sat Nov 5 12:53:17 2022 -Raw input contains 12557761 variants... Sat Nov 5 12:53:17 2022 -Plot layout mode is : r Sat Nov 5 12:53:17 2022 -Region to plot : chr7:156538803-157538803. Sat Nov 5 12:53:19 2022 -Extract SNPs in region : chr7:156538803-157538803... Sat Nov 5 12:53:43 2022 -Extract SNPs in specified regions: 5831 Sat Nov 5 12:53:44 2022 Finished loading specified columns from the sumstats. Sat Nov 5 12:53:44 2022 Start conversion and sanity check: Sat Nov 5 12:53:44 2022 -Removed 0 variants with nan in CHR or POS column ... Sat Nov 5 12:53:44 2022 -Removed 0 variants with nan in P column ... Sat Nov 5 12:53:44 2022 -P values are being converted to -log10(P)... Sat Nov 5 12:53:44 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Sat Nov 5 12:53:44 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Sat Nov 5 12:53:44 2022 -Maximum -log10(P) values is 7.948076083953893 . Sat Nov 5 12:53:44 2022 Finished data conversion and sanity check. Sat Nov 5 12:53:44 2022 Start to load reference genotype... Sat Nov 5 12:53:44 2022 -reference vcf path : /home/yunye/mydata/d_disk/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Sat Nov 5 12:53:45 2022 -Retrieving index... 16384 Sat Nov 5 12:53:46 2022 -Calculating Rsq... Sat Nov 5 12:53:46 2022 Finished loading reference genotype successfully! Sat Nov 5 12:53:46 2022 Start to create manhattan plot with 5831 variants: Sat Nov 5 12:53:46 2022 -Loading gtf files from:defualt Sat Nov 5 12:53:55 2022 -plotting gene track.. Sat Nov 5 12:53:55 2022 -Finished plotting gene track.. Sat Nov 5 12:53:55 2022 -Found 1 significant variants with a sliding window size of 500 kb... Sat Nov 5 12:53:55 2022 Finished creating Manhattan plot successfully! Sat Nov 5 12:53:55 2022 -Annotating using column CHR:POS...
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7f04648d44f0>)
Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed. For gene tracks, default is gtf_path="ensembl" , you can also use gtf_path="refseq" (NCBA RefSeq)
Sampling¶
There are more than 10 million variants in the original sumstats and it will take long to process the entrie dataset. Since it might take a while to process the entire datasets, let us just random sample 1 million variants for this tutorial.
mysumstats.random_variants(n=100000,inplace=True)
Infer genome build¶
In case you don't know the genome build of the sumstats
For details, see: https://cloufield.github.io/gwaslab/InferBuild/
mysumstats.infer_build()
Fix_id¶
You may notice that the SNPID is in CHR:POS_REF_ALT format. We want SNPID to be in a stadardized format chr:pos:ref:alt, we can use fix_id for this:
For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
mysumstats.data
Harmonise¶
gwaslab can harmonize the sumstats based on reference files.
- ref_seq : reference genome fasta file for allele alignment
- ref_rsid_tsv : tsv file for annotation of common used variants
- ref_rsid_vcf : vcf file for annotation of other variants
- ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
- ref_alt_freq : field in INFO of vcf file for alternative allele frequency
For details see: https://cloufield.github.io/gwaslab/Harmonization/
For reference data, see: https://cloufield.github.io/gwaslab/Reference/
mysumstats.harmonize(basic_check=False,
n_cores=3,
ref_seq="/Users/he/Documents/Mydata/human_g1k_v37.fasta",
ref_rsid_tsv="/Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv",
ref_rsid_vcf="/Users/he/Documents/Mydata/All_20180423.vcf.gz",
ref_infer="/Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz",ref_alt_freq="AF")
Wed Nov 2 13:28:01 2022 Start to check if NEA is aligned with reference sequence... Wed Nov 2 13:28:01 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:01 2022 -Reference genome fasta file: /Users/he/Documents/Mydata/human_g1k_v37.fasta Wed Nov 2 13:28:01 2022 -Checking records: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT Wed Nov 2 13:28:39 2022 -Variants allele on given reference sequence : 41568 Wed Nov 2 13:28:39 2022 -Variants flipped : 50063 Wed Nov 2 13:28:39 2022 -Raw Matching rate : 91.63% Wed Nov 2 13:28:39 2022 -Variants inferred reverse_complement : 0 Wed Nov 2 13:28:39 2022 -Variants inferred reverse_complement_flipped : 0 Wed Nov 2 13:28:39 2022 -Both allele on genome + unable to distinguish : 8369 Wed Nov 2 13:28:39 2022 -Variants not on given reference sequence : 0 Wed Nov 2 13:28:39 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:40 2022 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: alt->ea , ref->nea ... Wed Nov 2 13:28:40 2022 -Flipping 50063 variants... Wed Nov 2 13:28:40 2022 -Swapping column: NEA <=> EA... Wed Nov 2 13:28:40 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:28:40 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:28:40 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:28:40 2022 -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x Wed Nov 2 13:28:41 2022 Finished converting successfully! Wed Nov 2 13:28:41 2022 Start to infer strand for palindromic SNPs... Wed Nov 2 13:28:41 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:28:41 2022 -Reference vcf file: /Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz Wed Nov 2 13:28:41 2022 -Alternative allele frequency in INFO: AF Wed Nov 2 13:28:42 2022 -Identified 14159 palindromic SNPs... Wed Nov 2 13:28:42 2022 -After filtering by MAF< 0.4 , the strand of 12908 palindromic SNPs will be inferred... Wed Nov 2 13:29:07 2022 -Non-palindromic : 76762 Wed Nov 2 13:29:07 2022 -Palindromic SNPs on + strand: 12520 Wed Nov 2 13:29:08 2022 -Palindromic SNPs on - strand and need to be flipped: 24 Wed Nov 2 13:29:08 2022 -Palindromic SNPs with maf not availble to infer : 1251 Wed Nov 2 13:29:08 2022 -Palindromic SNPs with no macthes or no information : 175 Wed Nov 2 13:29:09 2022 -Identified 8369 indistinguishable Indels... Wed Nov 2 13:29:09 2022 -Indistinguishable Indels will be inferred from reference vcf ref and alt... Wed Nov 2 13:29:36 2022 -Indels ea/nea match reference : 3538 Wed Nov 2 13:29:37 2022 -Indels ea/nea need to be flipped : 4531 Wed Nov 2 13:29:37 2022 -Indels with no macthes or no information : 300 Wed Nov 2 13:29:37 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:29:38 2022 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: alt->ea , ref->nea ... Wed Nov 2 13:29:39 2022 -Flipping 4531 variants... Wed Nov 2 13:29:39 2022 -Swapping column: NEA <=> EA... Wed Nov 2 13:29:39 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:29:39 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:29:39 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:29:39 2022 -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4 Wed Nov 2 13:29:39 2022 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand ... Wed Nov 2 13:29:40 2022 -Flipping 24 variants... Wed Nov 2 13:29:40 2022 -Flipping column: BETA = - BETA... Wed Nov 2 13:29:40 2022 -Flipping column: EAF = 1 - EAF... Wed Nov 2 13:29:40 2022 -Flipping column: DIRECTION +-? <=> -+? ... Wed Nov 2 13:29:40 2022 -Changed the status for flipped variants: xxxxx[012]5: -> xxxxx[012]2 Wed Nov 2 13:29:40 2022 Finished converting successfully! Wed Nov 2 13:29:40 2022 Start to annotate rsID based on chromosome and position information... Wed Nov 2 13:29:40 2022 -Current Dataframe shape : 100000 x 12 Wed Nov 2 13:29:40 2022 -SNPID-rsID text file: /Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv Wed Nov 2 13:29:40 2022 -100000 rsID could be possibly fixed... Wed Nov 2 13:29:40 2022 -Setting block size: 5000000 Wed Nov 2 13:29:40 2022 -Loading block: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Wed Nov 2 13:35:24 2022 -rsID Annotation for 3296 need to be fixed! Wed Nov 2 13:35:24 2022 -Annotated 96704 rsID successfully! Wed Nov 2 13:35:24 2022 Start to assign rsID using vcf... Wed Nov 2 13:35:24 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:24 2022 -CPU Cores to use : 3 Wed Nov 2 13:35:24 2022 -Reference VCF file: /Users/he/Documents/Mydata/All_20180423.vcf.gz Wed Nov 2 13:35:24 2022 -Assigning rsID based on chr:pos and ref:alt/alt:ref... Wed Nov 2 13:35:41 2022 -rsID Annotation for 618 need to be fixed! Wed Nov 2 13:35:41 2022 -Annotated 2678 rsID successfully! Wed Nov 2 13:35:41 2022 Start to sort the genome coordinates... Wed Nov 2 13:35:41 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:41 2022 -Force converting POS to integers... Wed Nov 2 13:35:41 2022 -Sorting genome coordinates... Wed Nov 2 13:35:42 2022 Finished sorting genome coordinates successfully! Wed Nov 2 13:35:42 2022 Start to reorder the columns... Wed Nov 2 13:35:42 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:35:42 2022 -Reordering columns to : SNPID,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Wed Nov 2 13:35:42 2022 Finished sorting columns successfully!
<gwaslab.Sumstats.Sumstats at 0x7fee8400ff10>
Check the data again. Looks good!
mysumstats.data
| SNPID | rsID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:854168:C:T | rs79188446 | 1 | 854168 | T | C | 0.1484 | -0.0292 | 0.0156 | 0.06173 | 166718 | -?-+ | 1960010 |
| 1 | 1:856510:C:T | rs191530247 | 1 | 856510 | T | C | 0.0045 | 0.0604 | 0.1278 | 0.63650 | 166718 | +?+- | 1960010 |
| 2 | 1:981131:A:G | rs9697293 | 1 | 981131 | G | A | 0.0330 | 0.0183 | 0.0273 | 0.50200 | 166718 | -?+- | 1960000 |
| 3 | 1:1021408:G:A | rs11260587 | 1 | 1021408 | A | G | 0.0010 | -0.1388 | 0.2623 | 0.59680 | 166718 | -?+- | 1960010 |
| 4 | 1:1038997:ACTC:A | rs71576598 | 1 | 1038997 | A | ACTC | 0.2037 | 0.0162 | 0.0111 | 0.14410 | 191764 | +++- | 1960364 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99995 | X:154429214:A:G | rs4893075 | 23 | 154429214 | G | A | 0.1802 | 0.0063 | 0.0096 | 0.51160 | 191764 | +-+- | 1960000 |
| 99996 | X:154452019:G:A | rs113007852 | 23 | 154452019 | A | G | 0.0143 | 0.0374 | 0.0437 | 0.39210 | 191764 | +-++ | 1960010 |
| 99997 | X:154682147:C:A | rs5940511 | 23 | 154682147 | A | C | 0.7852 | -0.0081 | 0.0084 | 0.33460 | 191764 | -+-+ | 1960010 |
| 99998 | X:154740552:T:C | rs5940461 | 23 | 154740552 | C | T | 0.7972 | -0.0086 | 0.0090 | 0.33840 | 191764 | -+-+ | 1960000 |
| 99999 | X:154803312:C:T | rs781981622 | 23 | 154803312 | T | C | 0.6074 | -0.0080 | 0.0115 | 0.48580 | 191764 | -+-+ | 1960010 |
100000 rows × 13 columns
Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):
mysumstats.summary()
| Values | Percentage | ||
|---|---|---|---|
| Category | Items | ||
| META | Row_num | 100000 | NaN |
| Column_num | 6 | NaN | |
| Column_names | SNPID,rsID,EAF,P,N,STATUS | NaN | |
| Last_checked_time | Wed Nov 2 13:35:42 2022 | NaN | |
| MISSING | Missing_total | 618 | 0.62 |
| Missing_rsID | 618 | 0.62 | |
| MAF | Common | 50567 | 50.57 |
| Low_frequency | 17297 | 17.30 | |
| Rare | 32073 | 32.07 | |
| P | Minimum | 6.973e-74 | 0.00 |
| Significant | 73 | 0.07 | |
| Suggestive | 140 | 0.14 | |
| STATUS | 1960010 | 42746 | 42.75 |
| 1960000 | 34016 | 34.02 | |
| 1960011 | 6324 | 6.32 | |
| 1960001 | 6196 | 6.20 | |
| 1960364 | 4531 | 4.53 | |
| 1960363 | 3538 | 3.54 | |
| 1960007 | 629 | 0.63 | |
| 1960017 | 622 | 0.62 | |
| 1960309 | 529 | 0.53 | |
| 1960368 | 300 | 0.30 | |
| 1960008 | 189 | 0.19 | |
| 1960319 | 181 | 0.18 | |
| 1960018 | 175 | 0.18 | |
| 1960012 | 15 | 0.02 | |
| 1960002 | 9 | 0.01 |
Formatting and saving : to_format()¶
You can easily format the processed sumstats and save it.
For details see: https://cloufield.github.io/gwaslab/Format/
mysumstats.to_format("clean_sumstats",fmt="ldsc")
Wed Nov 2 13:35:42 2022 Start to format the output sumstats in: ldsc format
Wed Nov 2 13:35:42 2022 -Formatting statistics ...
Wed Nov 2 13:35:42 2022 - Float statistics formats:
Wed Nov 2 13:35:42 2022 - Columns: ['EAF', 'BETA', 'SE', 'P']
Wed Nov 2 13:35:42 2022 - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Wed Nov 2 13:35:42 2022 - Start outputting sumstats in ldsc format...
Wed Nov 2 13:35:42 2022 -ldsc format will be loaded...
Wed Nov 2 13:35:42 2022 -ldsc format meta info:
Wed Nov 2 13:35:42 2022 - format_name : ldsc
Wed Nov 2 13:35:42 2022 - format_source : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
Wed Nov 2 13:35:42 2022 - format_source2 : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
Wed Nov 2 13:35:42 2022 - format_version : 20150306
Wed Nov 2 13:35:42 2022 -gwaslab to ldsc format dictionary:
Wed Nov 2 13:35:42 2022 - gwaslab keys: ['rsID', 'NEA', 'EA', 'N', 'BETA', 'P', 'INFO', 'OR']
Wed Nov 2 13:35:42 2022 - ldsc values: ['SNP', 'A2', 'A1', 'N', 'Beta', 'P', 'INFO', 'OR']
Wed Nov 2 13:35:42 2022 -Output columns: Index(['SNP', 'A1', 'A2', 'Beta', 'P', 'N'], dtype='object')
Wed Nov 2 13:35:42 2022 -Output path: clean_sumstats.ldsc.tsv.gz
Wed Nov 2 13:35:44 2022 -Saving log file: clean_sumstats.ldsc.log
Wed Nov 2 13:35:44 2022 Finished outputting successfully!
Liftover¶
If the sumstats need liftover:
mysumstats.liftover(n_cores=1,from_build="19", to_build="38")
Wed Nov 2 13:49:15 2022 Start to perform liftover... Wed Nov 2 13:49:15 2022 -Current Dataframe shape : 100000 x 13 Wed Nov 2 13:49:15 2022 -CPU Cores to use : 1 Wed Nov 2 13:49:15 2022 -Performing liftover ... Wed Nov 2 13:49:15 2022 -Creating converter : hg19 to hg38 Wed Nov 2 13:49:16 2022 -Converting variants with status code xxx0xxx :100000... Wed Nov 2 13:49:50 2022 -Removed unmapped variants: 53 Wed Nov 2 13:49:52 2022 Finished liftover successfully!
mysumstats.data
| SNPID | rsID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1:854168:C:T | rs79188446 | 1 | 918788 | T | C | 0.1484 | -0.0292 | 0.0156 | 0.06173 | 166718 | -?-+ | 3860099 |
| 1 | 1:856510:C:T | rs191530247 | 1 | 921130 | T | C | 0.0045 | 0.0604 | 0.1278 | 0.63650 | 166718 | +?+- | 3860099 |
| 2 | 1:981131:A:G | rs9697293 | 1 | 1045751 | G | A | 0.0330 | 0.0183 | 0.0273 | 0.50200 | 166718 | -?+- | 3860099 |
| 3 | 1:1021408:G:A | rs11260587 | 1 | 1086028 | A | G | 0.0010 | -0.1388 | 0.2623 | 0.59680 | 166718 | -?+- | 3860099 |
| 4 | 1:1038997:ACTC:A | rs71576598 | 1 | 1103617 | A | ACTC | 0.2037 | 0.0162 | 0.0111 | 0.14410 | 191764 | +++- | 3860399 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99995 | X:154429214:A:G | rs4893075 | 23 | 155200937 | G | A | 0.1802 | 0.0063 | 0.0096 | 0.51160 | 191764 | +-+- | 3860099 |
| 99996 | X:154452019:G:A | rs113007852 | 23 | 155223738 | A | G | 0.0143 | 0.0374 | 0.0437 | 0.39210 | 191764 | +-++ | 3860099 |
| 99997 | X:154682147:C:A | rs5940511 | 23 | 155452486 | A | C | 0.7852 | -0.0081 | 0.0084 | 0.33460 | 191764 | -+-+ | 3860099 |
| 99998 | X:154740552:T:C | rs5940461 | 23 | 155510891 | C | T | 0.7972 | -0.0086 | 0.0090 | 0.33840 | 191764 | -+-+ | 3860099 |
| 99999 | X:154803312:C:T | rs781981622 | 23 | 155573651 | T | C | 0.6074 | -0.0080 | 0.0115 | 0.48580 | 191764 | -+-+ | 3860099 |
99947 rows × 13 columns
Note: Gwaslab only liftover CHR and POS, and when lifted, the last two digits status code will be rolled back to 99. Since for difference reference genome, the reference allele or strand might be reverse, so it is need to align and check agin.
For details, see https://cloufield.github.io/gwaslab/LiftOver/